Building Solid Demos . . . and what I learnt from it — Dataflow

I was building a pipeline partially for interest and partially for a solid demo into streaming data pipelines. Having a poke around the pipeline and forcing it to behave in certain ways ended up giving me a great deal of insight into one of the most powerful pieces of plumbing in GCP: Dataflow.

Setup

I then wrote a connector that streams the socket data to Pub/Sub. Its 15 line of code, just read from the socket and emit each line as a message to the Pub/Sub interface. The client library automatically batches (which introduces a bit of latency which we’ll see the effect of in the processing pipeline.

The dataflow job is custom as it connects to the subscription and not the default of the topic because while the pipeline was in development I didn’t want to lose data. I could just detach the dataflow job, re-write and re-attach the pipeline and just catch up on the messages. If I was using the topic approach, messages between the detach and re-attach operations would have been lost (unless I had another pipeline catching all the data).

Now, I wanted to put pressure on the pipeline so we could see some of the interesting features of the technology. I had initially deployed using the default of an n1-standard-4 machine (4 x vCPU and 3.75 GB RAM) and it handled the load without breaking a sweat. Pipeline lag was about 10 seconds (including transmission, Pub/Sub, DataFlow and BigQuery). That was not great, so I scaled down to an n1-standard-2 machine and it handled the pipeline almost as well, peaking at ~80% CPU on a single node. Not great, so down to n1-standard-1 and I got peaks at about 95% CPU so far, but no auto scaling. So I have a colleague who is going to run this next to their airport and add to the load and we should see scaling.

The two elements worth seeing though are the Pub/Sub buffering when I swapped from n1-standard-2 to n1-standard-1 machines. You can clearly see the spike in messages in the middle of the graph, as well as the increase unacked message count (the lag introduced by having a too small machine on the DataFlow side).

Unacked messages (unprocessed messages) and oldest message (lag indicator) with the spike during the switch to small nodes in DataFlow and the increased lag after the change.

Yes, I was trying to force this. Once it scales I’ll update this post to show that as well.

Also, one of the strange things is due to lockdown we actually have quiet periods where there are no flights overhead, which is weird, especially since the airport was one of the top 10 busiest airports in Europe. This caused an interesting lag element where the Pub/Sub messages would be batched through. Here are the graphs showing the lag vs message volume.

System lag
Message Volume

You can clearly see that when I have idle rate messages I have a much higher system lag. In production I would potentially reduce my batch size to counter this but there is a trade-off. My pipeline would then be optimized on the low message count side and not the high message count side. Engineering is always around trade-offs. Building this demo has given me another insight into another engineering decision around pipeline performance.

Senior Google Cloud Trainer