Mobile app APIs unresponsive

Incident Report for AWAIR

Postmortem

One of the most important aspect’s of Awair’s Backend system is being able to scalably store and fetch time series data (e.g. sensor data from devices). For this, we employ Google Cloud Platform’s BigTable NoSQL managed database.

The BigTable documentation suggests a maximum storage usage percentage of 70% if the cluster manages high traffic. This is because BigTable needs space to distribute tablets across the nodes in the cluster, dynamically balancing the tablets to keep each node at similar load. We reached over 95% usage on our cluster recently, and upgraded the number of nodes from 5 to 6 so that we could add additional storage and this caused BigTable to perform a tablet re-balancing, but it had little space with which to do it.

While rebalacing, the hottest node in BigTable reached 100% cpu utlization, and the overall cpu utilization of the BigTable cluster degraded to below 20%. During the incident BigTable read throughput shrunk to 1 megabyte per second and writes were blocked in BigTable. Undelivered messages for the write subscription that feeds BigTable grew to 80 million and unacknowledged messages for the write subscription that feeds BigTable climbed to 4000 seconds.

We added one more additional node in order to put the storage usage below the 70% suggested limit, and to give it more resources to use on the re-balance. During recovery, the rows read in BigTable peaked at 400 thousand per second and the pending write requests flushed completely. No data was lost and the cluster has returned to good health.

We will add internal alerting to BigTable to warn us when the storag utilization climbs above 70% so that this issue does not occur again.

Posted Jun 21, 2019 - 10:50 PDT

Resolved

All APIs operation are back to normal. The root cause is believed to be high CPU utilizations of BigTable nodes that host sensor data due to rebalancing of data with additional node being added to BigTable cluster. We will follow up with a proper postmortem.

Posted Jun 19, 2019 - 23:38 PDT

Update

Queued up sensor data is still being flushed to be written into BigTable. There might be minor latency for showing current sensor data in mobile app and other interfaces.

Posted Jun 19, 2019 - 23:18 PDT

Update

APIs latencies have gone down significantly and most of the services are back to normal. Still monitoring

Posted Jun 19, 2019 - 22:44 PDT

Monitoring

CPU utilization of the hot node in BigTable cluster going down and API latencies also going down. Still monitoring.

Posted Jun 19, 2019 - 22:34 PDT

Update

One of the nodes in BigTable cluster is showing very high CPU utilization and seems to be the root cause of this issue. Working to mitigate the issue.

Posted Jun 19, 2019 - 22:33 PDT

Update

Components in developer APIs and dashboard where sensor data is being fetched shows same phenomenon of high latency due to central internal component showing high latency.

Posted Jun 19, 2019 - 22:30 PDT

Identified

Potential cause for this issue is identified. It is believed to be caused by internal APIs for fetching time series data that causes the mobile app APIs to be unresponsive. Working to mitigate this issue.

Posted Jun 19, 2019 - 22:16 PDT

Investigating

Mobile app APIs are unresponsive since around 9:10PM PST. Looking into this issue.

Posted Jun 19, 2019 - 22:13 PDT

This incident affected: Developer APIs (Dashboard Developer APIs), Enterprise Dashboard, and Mobile Apps (Awair Business App).