One of the most important aspect’s of Awair’s Backend system is being able to scalably store and fetch time series data (e.g. sensor data from devices). For this, we employ Google Cloud Platform’s BigTable NoSQL managed database.
The BigTable documentation suggests a maximum storage usage percentage of 70% if the cluster manages high traffic. This is because BigTable needs space to distribute tablets across the nodes in the cluster, dynamically balancing the tablets to keep each node at similar load. We reached over 95% usage on our cluster recently, and upgraded the number of nodes from 5 to 6 so that we could add additional storage and this caused BigTable to perform a tablet re-balancing, but it had little space with which to do it.
While rebalacing, the hottest node in BigTable reached 100% cpu utlization, and the overall cpu utilization of the BigTable cluster degraded to below 20%. During the incident BigTable read throughput shrunk to 1 megabyte per second and writes were blocked in BigTable. Undelivered messages for the write subscription that feeds BigTable grew to 80 million and unacknowledged messages for the write subscription that feeds BigTable climbed to 4000 seconds.
We added one more additional node in order to put the storage usage below the 70% suggested limit, and to give it more resources to use on the re-balance. During recovery, the rows read in BigTable peaked at 400 thousand per second and the pending write requests flushed completely. No data was lost and the cluster has returned to good health.
We will add internal alerting to BigTable to warn us when the storag utilization climbs above 70% so that this issue does not occur again.