Using BigQuery Public Datasets for Small Projects

4 min readAug 18, 2020

Where my students are (or said) they’re from this week

I have written a small app to demonstrate integration between Google Sheets and AppEngine, essentially using a sheet as a database for the app. It is not the fastest or best way but it demonstrates the split between user and server space quite nicely. I decided recently to make it a little more interactive and attached a form to it and allowed people to let me know where they’re from. The problem with that is I had assumed people would put in ‘country’ when asked for the country. They didn’t. Many people are very proud of their cites, so I needed to add ‘New York’ to the list. Simple look up should solve that. Also, America, US and USA are all equivalent to ‘United States of America’. Another entry. Job done. I’ll just add the cities that don’t match when people put those in right?

I decided to poke around in the BigQuery public datasets for all the cities in the world. Cities have a well understood definition and need to be above a certain size.

The first thing I needed was to get all the cities’ point information. So I went and found the public OpenStreetMap dataset.

$ bq — project_id=bigquery-public-data ls geo_openstreetmap
 tableId Type Labels Time Partitioning Clustered Fields 
 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
 history_changesets TABLE 
 history_nodes TABLE 
 history_relations TABLE 
 history_ways TABLE 
 planet_changesets TABLE 
 planet_features TABLE 
 planet_layers TABLE layer_code, geometry 
 planet_nodes TABLE 
 planet_relations TABLE 
 planet_ways TABLE

Great, fairly sure the data is sitting in `planet_features`. The data is tagged and used osm_id as the unique identifier. What I wanted was the city point (latitude and longitude) and the name in English if possible. Also the tags are repeated key/value pairs, which contain the name, the name in English (name:en), the population and various other metrics. The city geometry could be a point or a polygon depending on the size. As the data presentation doesn’t need to be too accurate I decided to just use the centroid. My query for the city data then became:

with prepared_data as (
SELECT
    osm_id,
    all_tags,
    ST_CENTROID(geometry) AS geom
  FROM
    `bigquery-public-data.geo_openstreetmap.planet_features`
  CROSS JOIN
    UNNEST(all_tags) AS tags
  WHERE
    EXISTS (
    SELECT
      key,
      value
    FROM
      UNNEST(all_tags)
    WHERE
      key='place'
      AND value='city'))

This gave me every city, with all the tags and the geometry. Now I needed to filter out the data so I could use name:en if available or name if not. Now I wanted to pivot the data a bit (no hardcore code here, I’m afraid, I knew exactly which columns I wanted). I grouped the data by osm_id, took `anyvalue` for the geometry and took the non-null values for the name and name:en

SELECT
    osm_id,
    MAX(
    IF
      (key='name',
        value,
        NULL)) AS name,
    MAX(
    IF
      (key='name:en',
        value,
        NULL)) AS name_en,
    ANY_VALUE(geom) AS geom
  FROM
    prepared_data
  CROSS JOIN
    UNNEST(all_tags) AS tags
  WHERE
    key IN ('name',
      'name:en',
      'population')
  GROUP BY
    osm_id

I also wanted to convert the geometry into a latitude and longitude floating point value that Google Sheets could use for the lookup and use name_en if it existed or else use the name. I also added a small piece that would give me a Google Maps link to check that the data was being materialized correctly before I imported it into sheets. Giving me the full query as:

WITH
  prepared_data AS (
  SELECT
    osm_id,
    all_tags,
    ST_CENTROID(geometry) AS geom
  FROM
    `bigquery-public-data.geo_openstreetmap.planet_features`
  CROSS JOIN
    UNNEST(all_tags) AS tags
  WHERE
    EXISTS (
    SELECT
      key,
      value
    FROM
      UNNEST(all_tags)
    WHERE
      key='place'
      AND value='city') )
SELECT
IF
  (name_en IS NOT NULL,
    name_en,
    name) AS name,
  st_x(geom) AS longitude,
  st_y(geom) AS latitude,
  CONCAT("https://www.google.com/maps/@",CAST(st_y(geom) AS string),",",CAST(st_x(geom) AS string),",16z")
FROM (
  SELECT
    osm_id,
    MAX(
    IF
      (key='name',
        value,
        NULL)) AS name,
    MAX(
    IF
      (key='name:en',
        value,
        NULL)) AS name_en,
    ANY_VALUE(geom) AS geom
  FROM
    prepared_data
  CROSS JOIN
    UNNEST(all_tags) AS tags
  WHERE
    key IN ('name',
      'name:en',
      'population')
  GROUP BY
    osm_id)

The query cost is around $1 and gives just over 18100 cities that I could then use in my visualisation. A quick sanity check on coverage using BigQuery GeoViz shows pretty good coverage. I decided against using the population limit as using the city definition should give me a good lower bound on the data already.

Now I can carry on the D3 visualiztion of the sheets data in AppEngine because I had the data to support it.

Using BigQuery Public Datasets for Small Projects

Written by Paul Leroy