One of the most important aspect’s of Awair’s Backend system is being able to scalably store and fetch time series data (e.g. sensor data from devices). For this, we employ Google Cloud Platform’s BigTable NoSQL managed database.
We added one more additional node in order to put the storage usage below the 70% suggested limit, and to give it more resources to use on the re-balance. During recovery, the rows read in BigTable peaked at 400 thousand per second and the pending write requests flushed completely. No data was lost and the cluster has returned to good health.
We will add internal alerting to BigTable to warn us when the storag utilization climbs above 70% so that this issue does not occur again.
Posted Jun 21, 2019 - 10:50 PDT
Resolved
All APIs operation are back to normal. The root cause is believed to be high CPU utilizations of BigTable nodes that host sensor data due to rebalancing of data with additional node being added to BigTable cluster. We will follow up with a proper postmortem.
Posted Jun 19, 2019 - 23:38 PDT
Update
Queued up sensor data is still being flushed to be written into BigTable. There might be minor latency for showing current sensor data in mobile app and other interfaces.
Posted Jun 19, 2019 - 23:18 PDT
Update
APIs latencies have gone down significantly and most of the services are back to normal. Still monitoring
Posted Jun 19, 2019 - 22:44 PDT
Monitoring
CPU utilization of the hot node in BigTable cluster going down and API latencies also going down. Still monitoring.
Posted Jun 19, 2019 - 22:34 PDT
Update
One of the nodes in BigTable cluster is showing very high CPU utilization and seems to be the root cause of this issue. Working to mitigate the issue.
Posted Jun 19, 2019 - 22:33 PDT
Update
Components in developer APIs and dashboard where sensor data is being fetched shows same phenomenon of high latency due to central internal component showing high latency.
Posted Jun 19, 2019 - 22:30 PDT
Identified
Potential cause for this issue is identified. It is believed to be caused by internal APIs for fetching time series data that causes the mobile app APIs to be unresponsive. Working to mitigate this issue.
Posted Jun 19, 2019 - 22:16 PDT
Investigating
Mobile app APIs are unresponsive since around 9:10PM PST. Looking into this issue.
Posted Jun 19, 2019 - 22:13 PDT
This incident affected: Developer APIs (Dashboard Developer APIs), Enterprise Dashboard, and Mobile Apps (Awair Business App).